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Abstract 

Given a mathematical model quantifying the viral 
infection of pandemic influenza HlNlpdm09-H275 
wild type (WT) and HlNlpdm09-H275Y mutant 
(MUT) strains, we describe a simple method of 
estimating the model's constant parameters using 
Monte Carlo methods. Monte Carlo parameter 
estimation methods present certain advantages 
over the bootstrapping methods previously used in 
such studies: the result comprises actual parameter 
distributions (posteriors) that can be used to 
compare different viral strains; the recovered 
parameter distributions offer an exact method to 
compute credible intervals (similar to the 
frequentist 95% parametric confidence intervals 
(Cl)), that, in turn, using a suitable analysis 
statistic, will be narrower than the ones obtained 
from bootstrapping; given an appropriate 
computational parallelization, Monte Carlo 
methods are also faster and less computationally 
intensive than bootstrapping. We fit Gaussian 
distributions to the parameter posterior 
distributions and use a two-sided 
Kolmogorov-Smirnoff test to compare the two 
strains from a parametric point of view; our 
example result shows that the two strains are 94% 
different. Furthermore, based on the obtained 
parameter values, we estimate the reproductive 
number Rq for each strain and show that the 
infectivity of the mutant strain is larger than the 
wild type strain. 

Keywords: mathematical virology; Monte Carlo; 
parameter estimation; likelihood; 
Kolmogorov-Smirnoff 


1 Introduction 

The World Health Organization (WHO) declared the 
first influenza pandemic of the 21 st century in 2009 
and described the virus (HlNlpdm09) as naturally 
resistant to adamantanes but susceptible to the neu¬ 
raminidase (NA) inhibitors oseltamivir and zanamivir 
[1, 2]. The H275Y mutation within the NA gene was 
reported to be associated with the drug resistance over 
the past three years although the overall level of re¬ 
sistance has remained relatively low The pandemic 
strain completely displaced the prior seasonal H1N1 
strain (A/Brisbane/59/2007) [1], which, in the 2008- 
2009 season, was nearly 100% resistant to oseltamivir 
[3]. It was initially thought that the mutation usually 
compromised strain fitness (30), therefore the domi¬ 
nance of an oseltamivir-resistant strain is rather sur¬ 
prising. A return to widespread oseltamivir resistance, 
with a mutated HlNlpdm09 virus, could have signifi¬ 
cant public health consequences [1]. 

The H275Y amino acid substitution of the neu¬ 
raminidase gene is the most common mutation con¬ 
ferring oseltamivir resistance in the N1 subtype of the 
influenza virus. Using a mathematical model to ana¬ 
lyze a set of in vitro experiments that allow for the 
full characterization of the viral replication cycle, [4] 
show that the primary effects of the H275Y substitu¬ 
tion on the pandemic H1N1 (HlNlpdm09) strain are 
to lengthen the mean eclipse phase of infected cells 
(from 6.6 to 9.1 h) and decrease (by 7-fold) the viral 
burst size, i.e. the total number of virions produced per 
cell; [4] also find, however, that the infectious-unit- 
to-particle ratio of the H275Y mutant strain is 12-fold 
higher than that of the oseltamivir-susceptible strain 
(0.19 versus 0.016 per RNA copy). The multicompart¬ 
ment mathematical model presented in [4] makes use 
of ordinary differential equations (ODE) to describe 
the virus as a dynamical system that undergoes dif¬ 
ferent phases of evolution; the model is characterised 
by a set of model parameters 0i that [4] estimate us¬ 
ing bootstrapping. We will use the same mathematical 
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model approach (albeit with a changed set of ODEs) 
and data set used in [4] but employ a different param¬ 
eter estimation method - we will use a set of Markov 
Chain Monte Carlo (MCMC) codes to robustly esti¬ 
mate 6i parameter values and credible intervals from 
posterior distributions obtained from the MCMC sim¬ 
ulations. Our results, combined with a minimal num¬ 
ber of prior assumptions about the data, provide a 
more precise set of values for the viral parameters Oi 
and the method could be standardized and used in the 
future as a stand-alone parameter estimation package 
for viral infectious disease modelling purposes. 

The primary aim of this work is to show the benefits 
of using MCMC parameter estimation techniques over 
the traditional bootstrapping methods presented in, 
e.g. [4]; secondary aims, connected to the primary, are 
to introduce the reader to the theoretical support of 
the analysis method, present a number of steps taken 
to optimize the MCMC analysis process (resolving 
computational issues with the numerical ODE solver, 
changing the ODE model to a more computationally- 
robust form) and to re-state and refine the results pre¬ 
sented in [4] as obtained from using a different and 
more precise analysis method. As a whole, this work 
presents an easily standardizable analysis method for 
any given biological data set that can be described by a 
mathematical model whose parameters need to be es¬ 
timated robustly and fast; the theoretical framewrok is 
mostly Bayesian - with the only exception of the use of 
a frequentist test comparing parametric distributions. 

This article is divided as follows: in Section 2 we 
will present the theoretical fundamentals that we use 
throughout - the mathematical model that describes 
the viral infection dynamics together with its param¬ 
eters are described in [4], what we will outline is the 
Bayesian support of the MCMC method and the statis¬ 
tic used for the proposed method of estimation; in 
Section 4 we describe the computational challenges 
poised by the MCMC method - corrections applied 
to the numerical integration of the model’s ODE set 
and a rewritten, more computationally-robust, ODE 
model; in Section 5 we present the results of the study 
and formulate a discussion around the main conclud¬ 
ing points, presented in the last section, Section 6. 


disease spread characteristics before [6, 4, 5] - approx¬ 
imating the virus with a dynamical system that can 
be characterised by a set of state variables and a set of 
parameters, an ODE system will quantify the relation 
between the state variables and their rate of change 
with time. For this work in particular, viral yield mea¬ 
surements may be simulated by using the following 
multicompartment ODE model, as described in [4]: 
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with Eq and Iq the values of E and I in the first 
eclipse and infectious compartments respectively and 
A E = Ei — Ei- 1 and A I = I 3 — Ij-\ and that de¬ 
scribes the infection of a population of N suscepti¬ 
ble target cells T at a rate /3 Vpfu, here Vpfu is the 
quantity of infectious virus. Newly infected cells first 
undergo an eclipse phase E of average duration te be¬ 
fore becoming infectious I and producing virus at a 
constant rate p for an average time 77; Vrna repre¬ 
sents the total (number of RNA copies/ml) virus con¬ 
centration, controlled by the virus production rate p 
(number of RNA copies/ml/h), the conversion factor 
between virus produced and virus observed by titra¬ 
tion p (no of PFU/RNA copies), the rate at which in¬ 
fectious virus lose infectivity c (virus decay rate, 1/h), 
and a rate of virus particle loss crna- Here tie and nj 
are the numbers of eclipse and respectively infectious 
compartments and the virus particle production rate 
per cell, prna is defined as from [4] 


2 Theoretical aspects 


Prna = 0.5 x 10 6 x p 


( 2 ) 


A virus undergoes a number of phases in its trajectory 
towards infection and replication: (in brief) it will at¬ 
tach and enter the host cell, inject its genetic material 
and convert the cell to become a virus producer and 
then it will exit the host and repeat the cycle [5] ; these 
phases can be modelled mathematically looking at the 
virus and host cells as two distinct populations. Math¬ 
ematical models have been applied to simulate viral 


To determine the in vitro infectivity of a particular 
strain, [4] used the infecting time 



which is the amount of time required for a single in¬ 
fectious cell to cause the latent infection of one more, 
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within a completely susceptible population. In the 
same framework we define the reproductive number 
Ro, the number of cases one case generates on average 
over the course of its infectious period, in an otherwise 
uninfected environment: 


Rq — 


ppP 

c + CrnA 


(4) 


The model in equation (1) (composed of the time- 
differential equations Xk(@i,t) = 0 with k state vari¬ 
ables Xk and constant model parameters 9i) describes 
the virus; we would like to test if this model describes a 
certain data set of discrete time points Xj(t). In doing 
so we solve the differential equations and fit the solu¬ 
tions to the data; this process makes use of minimizing 
the error function: 


SSR = J^(X 0 k f(9 i ,t j )-x j ) 2 (5) 

3 


or the sum of the squared residuals (SSR) r,j = 
X ok f(9 i ,t j ) — Xj with respect to the parameter set 
9i , where X k {tj) = Xokf(0i,tj) = qj is a solution of 
the X k (9i 1 t) = 0 differential system in the vicinity of 
the jth data point and Xq k is the state variable XV s 
initial value. We see immediately that both the nature 
of any given minimum of the SSR for a given parame¬ 
ter set and that very same recovered parameter set 9i 
will depend on the initial conditions X ok 


0SSR \ ( df(0,X ok ) \ 

dO A 1 d9 ). 


( 6 ) 


Lest to say, initial values represent themselves a field 
of its own and how they are treated is very important 
to the final analysis outcome - for brevity, we will state 
that all our initial parameter values are to be chosen 
within physically and biologically motivated intervals; 
we will update the reader as to our choice of initial 
values for each of the experiments described here. 

Our data Xj is composed of four data sets for which 
the two measured (explicit) state variables X k (tj) are 
the infectious viral load Vpfu = Vpfu (tj) and the total 
amount of virus Vr,na = VrnA (tj) ■ All other state vari¬ 
ables (number of cells T(t), number of cells in eclipse 
phase E(t) etc.) are derived from the system (1) with 
appropriate physical initial conditions e.g. To = 10 6 , 
E 0 = I 0 = D 0 = 0. The system’s fixed parameters 0, ; 
are to be estimated using a fitting-to-data algorithm 
that will sample values for from a parameter space 
with the aim of minimizing the error function (5) by 
means of (6); specifically, our set of parameters 


(7) 


The sampling process and minimization of the SSR can 
be efficiently done by using a Monte Carlo method. 
The principle behind any given Monte Carlo analysis 
[7, 8] is the Markov Chain (MC) - a memoryless chain 
that can be easily represented by a particle jumping 
through a set of consecutive states towards the state 
of lowest potential energy; in this process, the particle 
will retain only the information describing the current 
and immediately previous states in order to decide if 
the step is viable or not, all other information is for¬ 
gotten. The particle will decide to jump states only 
based on these two pieces of information: if the energy 
of the next state is lower than the previous one, the 
particle will jump states. More generally, we want to 
generate random draws from a target parametric dis¬ 
tribution r(0i). We then identify a way to construct 
a Markov Chain such that its equilibrium probabil¬ 
ity distribution is the target distribution T. If we can 
construct such a chain then we arbitrarily start from 
some point in the parameter space and iterate the MC 
many times. Eventually, the draws we generate would 
appear as if they are coming from our target distribu¬ 
tion. We then approximate the quantities of interest 
(e.g. mean) by taking the sample average of the draws 
after discarding a few initial draws (“burn-in” draws) 
which is the Monte Carlo component. There are several 
ways to construct Markov Chains (e.g., Gibbs sampler, 
Metropolis-Hastings algorithm [8]). 

In the absence of noise, given a set of MCMC “walk¬ 
ers” (an elementary particle undergoing a “random 
walk”, a physical approximation of the exploration 
of the Markov state space) exploring the parameter 
space 9i , the MCMC engine will minimize the residu¬ 
als Tj i.e. solve the ODE system X k (0i,t) = 0 given 
by (1), obtain values for the error function SSR from 
equation (5) for 9i and the walkers will converge to¬ 
wards a parametric position where the SSR has a min¬ 
imum (whether it be local or global); in terms of distri¬ 
butions, assuming no major physical noise source af¬ 
fecting the measurements in a systematic manner and 
an almost perfect mathematical model across all data 
sets, residuals r 7 are normally distributed and the SSR 
has a x 2 distribution with p degrees of freedom. Using 
the probability distribution function (PDF) of a x 2 - 
distributed variable and the empirical condition that 
the lower the SSR the better the model fit to the data, 
the likelihood in a maximum likelihood estimation case 
may be given by equation (8) 

£(SSR) = Ce -8811 / 2 (8) 

where C is a constant for a given p degrees of free¬ 
dom. By definition, equation (8) gives the likelihood 
of a walker transitioning from a given state, charac¬ 
terised by a certain parameter set, to a current state 


9% C [tj;, Tj , Tig, Tlj , /?, C, CrjnjA; P> IP] 
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characterised by a parameter set Oi and a fitting error 

SSR(fli). 

From the Bayesian perspective, there are known and 
unknown quantities: the known quantity is the data, 
represented by the number of discrete time points 
Xj(tj ) (evidence); the unknown quantity is the proba¬ 
bility P(O t \xj, z ) - the posterior distribution of param¬ 
eters Oi given evidence Xj and a model z (e.g. system 
(1)). Using Bayes’ rule, this probability function can 
be written: 


P(0i\xj,z) 


P(8j)P(xj,z\0i) 
P{Xj,Z ) 

P{xj,z\0i)P{0i) 

Jo P(xj,z\0i)P(0i)d0i 
£(SSR)P(0j) 
fo ^(SSR(0i))P(0 z )d0i 


(9) 


Assuming a flat prior i.e. P(0i) = 1, and a finite con¬ 
stant value for the integral h 0 = £(SSR(8i))P(0i)d0i , 
it follows from equation (9) 


P(8i\Xj,z) ->■ 

= e ^l =A e- ss ^ 2 ( 10 ) 

ho 


or 


3 Weighted analysis statistic 

Let’s introduce now the four independent experiments 
that we have data for (the data for this work was taken 
from [4], publicly available in the form of viral load 
time evolution curves); all four data sets are modelled 
by the same ODE system (1) that uses the same set 
of parameters 0,; as presented in [4] and in equation 
(7). These four experiments are called single-cycle viral 
load measurement (SC), multiple-cycle viral load mea¬ 
surement (PFUMC), multiple-cycle RNA load mea¬ 
surement (RNAMC) and mock yield viral load mea¬ 
surement (MY); for an in-depth description of each of 
these experiments and how they were carried, consult 
[4], it is not the purpose of this work to describe the 
experiments from a methodological point of view since 
the author has not conducted any of the experiments, 
nor he had he any involvement in the data collection 
procedure. It is, however, important to mention that, 
from a statistical point of view, some parameters are 
better recovered by fitting the model to only a cer¬ 
tain subset of the four data sets than the entire set. 
In [4] the same ODE model is applied to all four data 
sets, and it is assumed that the SSR values have sim¬ 
ilar distributions across the data sets (\ 2 with equal 
degrees of freedom p or log-normal distributions with 
equal means and variances), therefore the sum of the 
SSR values of the individual SC, PFUMC, RNAMC 
and MY experiment components was used as overall 
analysis statistic 


P(0i\xj,z) oc £(SSR) (11) 

It is easy to understand P(0i\xj,z) now, as the poste¬ 
rior distribution of parameters Oi, the very same above 
mentioned distribution T(0j) = P(0i\xj,z) in which 
the MCMC walkers will walk through i.e. which the 
MCMC analysis will repeatedly sample to obtain a set 
of values for the SSR. The walkers will aim towards 
regions of low SSR (valleys) and there they will ex¬ 
plore the parameter space in a more thorough manner 
(higher sampling rate) until they will reach an SSR 
minimum, where the sampling will cease and the chain 
will end. There is one danger to this process: certain 
walkers may have the tendency, highly dependent on 
the initial parameter values, to converge to a local min¬ 
imum that is sometimes rather far from the global min¬ 
imum. The global minimum needs to be reached since 
a local minimum can often present us with relatively 
small fitting errors (SSRs) but a completely different 
set of parameter values from the true values given by 
reaching the global minimum. This is why it is nec¬ 
essary to “aid” the MCMC walkers to step over these 
local valleys, by setting a correct analysis statistic, fol¬ 
lowing in the next section. 


SSR = SSRsc + SSRpfumc + 

+ SSRrnamc + SSRmy (12) 

In reality, due to high levels of experimental noise (that 
we do not account for here and [4] does not analyze 
either) and due to the mathematical model perform¬ 
ing in an uneven manner in describing the data sets, 
these SSR components do not have log-normal distri¬ 
butions but rather are y 2 distributed with different de¬ 
grees of freedom p. For simplicity, we will approximate 
these distributions as log-normal but we will apply 
weights to each SSR component in the following man¬ 
ner: from performing 10,000 fitting trials with param¬ 
eters randomly sampled from a 10-dimensional para¬ 
metric “box” (see Table 2) , we obtain the SSR compo¬ 
nent distributions shown in Figure 1. Fitting normal 
distributions in log-normal space, we obtain a set of 
reference numbers in Table 1. We must stress upon 
the fact that the log-normal approximation does not 
play any other role in the likelihood formulation apart 
from assigning weights to the different SSR compo¬ 
nents. From Table 1 we see an approximately twofold 
contribution from the PFUMC component to the over¬ 
all SSR sum. This is an average effect, for a given 
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SSR component 

Mean 

Std. Dev. 

sc 

136.0 

3.2 

PFUMC 

308.0 

4.0 

RNAMC 

98.0 

3.6 

MY 

12.7 

2.9 

Combined 

689.0 

3.0 


Table 1 Measures of SSR component means and variances, 
when assuming log-normal distributions, from running 10,000 
fitting trials with parameters randomly sampled from the 
parameter "box” in Table 2. 


uniformly sampled parameter box with large sampling 
boundaries, and does not reflect any local behaviour 
of the SSR components w.r.t. the additive overall SSR 
( 12 ). 


SSR components/SSR cummulative distributions 



Figure 1 SSR components distributions; last panel — additive 
overall SSR (the sum of components, equation (12)) 
distribution; normal distributions are fitted (dashed lines) and 
their means and variances are shown in Table 1. 


This effect has implications when performing a pa¬ 
rameter estimation analysis: if there are no weights 
applied on the SSR components, on average, the pa¬ 
rameter subset that optimally describes the PFUMC 
data set will always be dominant. Poorly constrained 
parameters, in the case of using an additive unweighed 
likelihood from four data sets, have a very low recover¬ 
ability for certain regions of the parameter space where 
one or more components of the likelihood can not con¬ 
strain effectively the others. In terms of error analysis, 
in these regions, the Gaussian-stationary distribution 
of the residuals (mean zero and variances equal for all 
four data sets) is not effective any more. 


Weighting the SSR components has to be done based 
on each component’s x 2 distribution parameters (num¬ 
ber of effective degrees of freedom), but because it is 
quite difficult to account for effective degrees of free¬ 
dom in order to weigh the likelihood components (de¬ 
fined as number of data points minus number of ef¬ 
fective model parameters), we could devise a simple 
method to weigh by component by re-writing the to¬ 
tal SSR: 



where a = 4 is the number of components - PFUSC, 
PFUMC, RNAMC and MY, n a is the number of points 
per data set a, and SSRo, a is a fixed threshold value 
for each of the data sets, that aids the MCMC walkers 
to move rapidly away from a parameter region that 
yields relatively low SSR values but poor parameter 
values (a “steep valley” of a local minimum). For a 
fixed SSR 0ta = 250 the new SSR compared to the 
simple additive SSR, will have the same profile for low 
values and a much stronger profile for high values, see 
Figure 2. Thus, from equation (13) the likelihood (8) 



Figure 2 New weighted SSR (from equation (13), continuous 
line) compared to standard unweighed additive SSR (from 
equation (12), dashed line), as used in previous works as 
analysis statistic. 


will be rewritten to 



By definition, equation (14) gives the likelihood of a 
walker transitioning from a given state, characterised 
by a certain parameter set, to a current state char¬ 
acterised by a parameter set 9i and a weighted and 
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thresholded fitting error SSR(0;). By weighting and 
thresholding the standard additive SSR and construct¬ 
ing likelihood (14), we will aid the MCMC walkers to 
converge faster to the low SSR region and find the 
global minimum corresponding to min(SSR). 

4 Computational aspects: optimal 
integrator settings and functional 
expression of the model 

Now that we have the four data sets, a model (1) with 
parameters (7) to be estimated, and a likelihood (14) 
function, we are ready to use an MCMC engine to 
obtain posterior distributions (11) of our parameters. 
For an MCMC engine we used emcee , presented in [9]. 
emcee is a Python-based, fully-parallelizable, MCMC 
engine that takes a likelihood expression ((14) in our 
case) and a set of initial values to sample the parameter 
posteriors, emcee is becoming a preferred tool to data 
analysis from different fields, see [10, 11] for examples 
of its practical use. 

In order to simulate the full set of MCMC production 
runs with emcee , and since at each iteration, to be 
able to compute the likelihood, the SSR function (13) 
needs to be computed, we ran a set of 20,000 fitting 
trials with parameters randomly sampled from a 10- 
dimensional “box” with dimensions listed in Table 2. 
The dimensions for this “box” were chosen based on 
phenomenology of the viral data - the “box” is large 
enough to comprise most possible parameter values hit 
by the MCMC walkers 

As a result of these test runs, we have noticed failure 
rates of 16-18% of the ODE solver, a rather significant 
fraction when performing a full production MCMC 
analysis. A few of the failed cases are shown in Figure 
3, where the dots represent the actual data points and 
the lines represent the solutions offered by the solver. 
These are consequences of system (1) being a stiff 
ODE system. In mathematics, a stiff differential equa¬ 
tion is an equation for which certain numerical meth¬ 
ods for solving it are unstable, unless the step size is 
taken to be extremely small [12]. It has been proven 
difficult to formulate a precise definition of stiffness, 
but the principal idea is that the equation includes 
some terms that can lead to rapid variations in the 
solution. While simulating the MCMC runs, we have 
noticed the failure of the employed numerical integra¬ 
tor - Scientific Python’s scipy.integrate.odeintf) - to 
give exact solutions to the ODE system and initially 
there was no simple rule or pattern to where the in¬ 
tegrator failed. These failures are grouped as follows, 
depending on the type of exit output: 

• The most frequent exit case of the integration pro¬ 
cess was associated with Python’s “math domain 


error”, a typical computing error that occurs when 
(in our case) the routine tries computing a loga¬ 
rithm of a negative real number - in this case, the 
computation of the SSR value was not completed 
due to negative values of the PFUMC viral load; 
this error would be recorded for another case as 
well: the integrator would fail due to the need for 
a larger number of integration steps (larger than 
the default maximum of 500) in this case the 
integration would be done up to a certain point 
where a steep change in the V ( t ) curve, the in¬ 
tegrator would exit and the rest of the viral load 
values would be of order 10 -300 , below the ma¬ 
chine precision of numpy.log 10; 

• On occasions, we noticed ’NaN’ values of the SSR, 
produced by positive but infinitesimally small val¬ 
ues of viral output, referring us to the previous 
case; 



Time (h) 


Figure 3 Failed integrations with scipy .integrate. odeint() 
exiting with Python's “math domain error” (solutions obtained 
using the parameter sets drawn from the “box” given in Table 
2; we notice a series of problematic behaviour types of the 
solutions — oscillations (with different amplitudes and 
frequencies) at artificially very low values of viral load 
VpFUMcM- typical cases of solver’s incapacity to reach a 
stable solution. 


The scipy.integrate.odeint(f(t),t,y0,*args) ODE 
integrator (see documentation available here http: // 
docs.scipy.org/doc/scipy/reference/generated/ 
scipy.integrate.odeint.html) uses an adaptive 
timestep Runge-Kutta 4-5 method written inside its 
Fortran solver library (see http://docs.scipy.org/ 
doc/numpy/user/install .html for included compiler 
instructions) to integrate systems of ordinary differ¬ 
ential equations. The number of time steps taken 
by the integrator varies but is capped by default at 
mxstep= 500. Also, the odeint function takes a list 
of output timesteps, but the Fortran routine is called 
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Parameter Min. value 

Max. value 

P{ l/ h ) 
/3(l/h) 

c(l/h) 

CRjVA(l/h) 

P 

Te(h) 

r/(h) 

E“ 0 (PFU/ml) 

max(yppuMc)* 

m „ 

-IQ — 3 s, max ( V RNAMC) 

10 2 x max(VpFUMc) 

in3 y m ax(V RNAMC ) 

(max(VppuMc )) 2 

0.01 

0.001 

10 -5 

3.0 

0.1 

VoPFUMC 

105 

(max( VppuMC )) 2 

1.0 

0.1 

1.0 

30.0 

60.0 

10 2 X VqpfuMC 


* max(VpFUMc) represents the maximum viral load from the PFUMC data set; 

** Vomc represents the initial value for the PFUMC viral load, parameter to be estimated as well. 

Table 2 The parametric 10-dimensional “box” from wich we randomly sampled parameters for the 20k ODE solving trials. 


once for each desired output and uses the previous 
call as the initial conditions for the next one. This can 
prove detrimental in cases where the specified time 
points where the integral should be evaluated is small 
(order 12-16, as in our data sets). The problem arises 
from the fact that, unlike more stable but slower MAT- 
LAB integrators, odeint does not use only the initial 
and final time points to generate a smooth, continu¬ 
ous integral and further on use interpolation to get the 
function values at each time point; it rather integrates 
piecewise for each segment determined by the specified 
time points series, process that is prone to breaking if 
there are too few time points to allow for continuity. 
One solution to this problem is to increase the pa¬ 
rameter mxstep - the number of integration steps - 
that is capped by default at 500. Increasing mxstep to 
5000 or 10000 would insure that even for order 12-16 
data points the integrator will solve the system for all 
the specified time points, and will not exit without 
reaching a stable solution. 

Unfortunately, increasing the maximum number of 
integration steps mxstep will affect the time it takes 
scipy.odeint to solve a system: if the system is solv¬ 
able with the default maximum mxstep < 500, by in¬ 
creasing mxstep to, say, 10000, will not increase the 
solving time, as seen from Table 3; increasing mxstep 
will only increase the integration time for systems that 
scipy .odeint can not solve for a given number of time 
points and will exit after the first one or two integra¬ 
tion points, most of the cases we are faced with. Table 
3 shows the time it took the solver to integrate 100 
times the same ODE system, with the same param¬ 
eters, same initial conditions, same initial and final 
time, only with different numbers of integration time 
points and mxsteps. Thus, using a larger mxstep than 
the system default one is a necessary option for an 
MCMC production run, but it will, in turn, increase 
the processing time. 

One could apply corrections for accuracy by chang¬ 
ing the default values for tolerances rtol and atol (in¬ 
ternal ODE solver error handling parameters, see doc¬ 
umentation at http://docs.scipy.org/doc/scipy/ 


12 time points 

Time/100 runs 

mxstep=5e2 

mxstep=le3 

mxstep=le4 

mxstep=le6 

dt= 51.8s 
dt= 169.8s 
dt= 167.9s 
dt= 221.9s 

100 time points 

Time/100 runs 

mxstep=5e2 

mxstep=le3 

mxstep=le4 

mxstep=le6 

dt= 188.6s 
dt= 188.9s 
dt= 189.9s 
dt= 206.4s 


Table 3 Time it took scipy .odeint^) to integrate 100 times the 
same ODE system, with the same parameters, initial conditions, 
initial and final time, only with different numbers of time points 
and mxstep’s. The reason why for 12 time points and 
mxstep =500 the run time is short is that the integrator exits 
after the second or third time point with no valid solution. 


ref erence/integrate . html) to orders of 10“ . The 

problem with reducing the tolerance to very small val¬ 
ues is that the solving for an otherwise exact within the 
default tolerance solution will take much longer time 
(of order 15-20 times longer than the solving without 
adjusting tolerance) - this, in light of a set of numerous 
MCMC simulations, could be a very much unwanted 
behaviour. 

There is, however, an alternative solution to the 
ODE incapacity problem: by effecting a change of 
state variable, we rewrite the ODE’s from system (1) 
in an exponential manner. Our state variables are 
T, Ei , Ij , VpFu j Urna and D with i,j the number of 
eclipse and infectious compartments respectively. Con¬ 
sider any of these real and positive defined state 
variables. We will operate a change of variable that 
will map (f) to a different real (both positive and neg¬ 
ative defined) variable Q with the following functional 
relation: 

(15) 

and this way the differential forms, by applying the 
chain rule, will be: 

d$ _ /dD\ 

d t \dn) t \dt)^ 
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With this in mind we rewrite the computational form 
of our ODE system (1) as: 


bpFU 

(3~ n i \ 

= ppe~ Vp ™ eI] 

| - (c + Crna) 

Vrna 

\i=i ) 

(j=ni \ 

= pe~ v ™ A eI ° 

CRN A 

T 

Vi=i J 

= —(3e VpFU 


Ei 

— ^gCr+VpFu— Eo) _ 


Ei 

= ke~ AE — k 


h 

= ke E " E ~ l0 — S 


ij 

1 

►-H 

<1 

1 

<D 

*-0 

II 


D 

= 5e Ini e~ D 

(17) 


with Eq and Iq the values of E and I in the first eclipse 
and infectious compartments respectively and A E = 
Ei — Ei _i and AI = Ij — Ij-\. 

The integration initial conditions are straightfor¬ 
ward from (1) and (17): natural logarithms of the 
original, non-exponential ODE system, i.e. Vp Fl j = 

1^pfu)^rna = ln(x41),T( 17 ) = 0.0, EW = 
/l 1 ') = D 1 17 1 = — n where n is a relatively large natu¬ 
ral number (30-100). The reason why E^ 11 ' 1 = j( 17 ) = 
L>( 17 ) = — n is that we need a good approximation for 
null initial conditions from E^ and I^ and these do 
not introduce errors in the solver due to too small a 
number. 

We have tested the re-written ODE system (17) 
against the one already in use (1) in [4]. Figure 4 (left) 
shows system (17) finds the exact same solution for a 
favourable best fit parameter set (the SSR was iden¬ 
tical between the two systems (1) and (17) to preci¬ 
sion order 10 -5 ); the system (17) is more computation¬ 
ally intensive and needs longer time to solve than the 
non-exponentiated system (1); it also needs a larger 
number of the maximum step value (we have obtained 
the best fit curve with mxstep > 4000). The time it 
took system (17) to integrate with best fit parameters 
was 48.9s (for 50 trials) whereas the non-exponentiated 
system (1) took 29.6s for the same number of trials. 
Figure 4 (right) shows the same comparison but with 
a set of unfavourable parameters that would induce 
the non-exponentiated system (1) to produce numeri¬ 
cally unstable solutions (hence damped oscillatory be¬ 
haviour); system (17) produces the correct solution 
and in a shorter time - 55.3s (for 50 trials) whereas 
the non-exponentiated system took 159.2s for the same 




Figure 4 (left) shows system 17 finds the exact same solution 
for a favourable best fit parameter set (the SSR was identical 
between the two systems (1) and (17) to precision order 
10 5 ); the system (17) is more computationally intensive and 
needs longer time to solve than the non-exponentiated system 
(1); it also needs a larger number of the maximum step value 
(we have obtained the best fit curve with mxstep > 4000). 
The time it took system 17 to integrate with best fit 
parameters was 48.9s (for 50 trials) whereas the 
non-exponentiated system (1) took 29.6s for the same number 
of trials, (right) shows the same comparison but with a set of 
unfavourable parameters that would induce the 
non-exponentiated system (1) to produce numerically unstable 
solutions (hence damped oscillatory behaviour); system (17) 
produces the correct solution and in a shorter time - 55.3s (for 
50 trials) whereas the non-exponentiated system took 159.2s 
for the same number of trials, with a high percentage of 
unstable solutions 


number of trials, with a high percentage of unstable 
solutions. 

We repeated the 20,000 trials with parameters ran¬ 
domly sampled from the “box” of Table 2 above but 
using the system given by equation (17) this time. All 
solutions were stable, so failure rate in computing a 
numerical value for the SSR due to non-physical neg¬ 
ative V ( t) values is 0%. 

5 Analysis and results 

In order to obtain a set of “best fit” parameters i.e. a 
set of parameters for which model (17) best describes 
the data, we performed a number of ~1500 bootstrap 
runs. Any given bootstrap run is characterised by a 
set of initial conditions (initial parameter values, ran¬ 
domly chosen from the parametric “box” (Table 2), an 
ODE system (17) and an error function that quanti¬ 
fies how well the solution fits the data, given in equa¬ 
tion (13); for the thresholds we chose values of order 
3 times larger than the “best fit” values i.e. SSR a in 
equation (13) we chose SSRsc = 10, SSRpfumc = 8, 
SSRrnamc = 6 and SSRmy = 3 (|). This ensemble 
is passed to a least squares engine that walks the sys¬ 
tem through a large combination of parameter values, 
with each iteration coming closer to the global mini¬ 
mum position in parameter space i.e. minimizing the 
SSR. We then chose the single parametric combination 
with the lowest SSR from these bootstrap runs. Such 
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a global minimum “best fit” to the data is shown in 
Figure 5 for three of our four data sets (SC, PFUMC 
and RNAMC); the parameters for this case are used 
to construct the intervals to start the MCMC walkers 
from: the “best fit” parameter set is used as the center 
of the interval with boundaries given by the mean of 
all the bootstrap parameter values. 

The bootstrapping process randomly samples the fit¬ 
ting errors with replacement; this set is usually as¬ 
sumed to be from an independent and identically dis¬ 
tributed population, and it is implemented by con¬ 
structing a number of resamples with replacement of 
the same size as the initial set. The main problem of 
using bootstrapping to estimate biological parameters 
is that the mathematical model errors are not sampled 
from the same distribution - the model is applied to 
different data sets that it describes differently; there 
is also the experimental noise that introduces fluctua¬ 
tions in the fitting errors, but this we can not model in 
this work. In our situation, using bootstrapping to es¬ 
timate parametric confidence intervals will introduce a 
bias towards the data set(s) that are worse described 
by the model and the parametres will have broader 
distributions, usually with heavy tails in the regions 
where the model constrains poorly those data sets. On 
the other hand, the MCMC analysis will not assume 
all error samples are drawn from the same distribution 
and will most often reject those parameter regions pro¬ 
ducing the bootstrap distribution tails. This bing said, 
using bootstrapping is a good way to obtain “best fit” 
parameter intervals that can be further used as initial 
values for the MCMC runs. 



The MCMC simulations have been run on multi¬ 
ple CPU’s using the OpenMPI (Open Message Pass¬ 
ing Interface http: //www. open-mpi . org/) implemen¬ 
tation in emcee package (see [9] and the guide at 
[13] for instructions on how this mode is configured, 


specifically MP4Py); the computational load has been 
divided and managed on Compute Canada’s SHAR- 
CNET academic and scientific-usage computer clus¬ 
ters (https://www.sharcnet.ca/). The analysis pro¬ 
cedure comprised the following specifications: 

• According to emcee 's user manual [13] it is desired 
to start each of the walkers from a favourable po¬ 
sition in the parameter space - as such, we started 
all the walkers on a tight ball centered at the “best 
fit” parameters of each of the four experiments. 
The “best fit” parameters are seen in Table 4 to¬ 
gether with the walkers’ start intervals, labelled 
as W columns; 

• We ran 2000 iterations for each of the 100 walkers 
per combined data sets (SC, PFUMC, RNAMC 
and MY) with 200 iterations for burn-in; sam¬ 
pling was performed in linear parameter space. 
The emcee package needs just a few inputs to be 
able to handle the MCMC simulations, the most 
important input being the an expression for the 
likelihood of accepting or rejecting the parameter 
set of any given walker’s position in the parame¬ 
ter space. This likelihood is given by equation (14) 
and is directly coded in, making use of the SSR 
thresholds (|); 

• Chains had average acceptance rates of 30-40%; 
convergence and chain mixing was checked using 
a Gelman-Rubin diagnostics test and chains that 
did not reach convergence were discarded (the 
Gelman-Rubin test [14] is used to check the con¬ 
vergence of multiple MCMC chains run in paral¬ 
lel; it compares the within-chain variance to the 
between-chain variance); 

• Only parameter sets from walkers with SSRj5.0 
were kept for final parameter estimation, the rest 
of the sets being discarded. 

The numerical results are presented in Table 4; the 
listed parameter value ^ is the mean of posterior dis¬ 
tributions obtained from the MCMC simulations; the 
credible interval is obtained from fitting a Gaussian 
distribution to the posterior distribution, see Figure 6. 
The Gaussian fit agrees very well with the actual pos¬ 
terior distributions obtained from the MCMC simula¬ 
tions. 

In order to quantify the difference between the two 
strains, WT-H275 and MUT-H275Y, from a paramet¬ 
ric point of view, we construct the distance in param¬ 
eter space r 



where N is the number of estimated parameters, KSi 
is the result of a two-sided Kolmogorov-Smirnoff test 
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WT-H275 and MUT-H275Y data: emcee MCMC Simulations Results for ODE Model (17) 

- initial value interval; ( 2 ) - obtained median value; ( 3 ) - obtained credible interval from posterior distribution 


Parameter 

WWT-H275 

( 2 )WT-H275 

(3)WT-H275 

WMUT-H275Y 

( 2 )MUT-H275Y 

( 3 >MUT-H275Y 

Mean eclipse period, te (h) 

6.6±3.0 

6.6 

6.3-7.1 

9.1±3.0 

9.3 

9.1-9.6 

Eclipse period SD, cte (h) 

1.2±2.0 

1.3 

1.1-1.5 

1.6±2.0 

1.6 

1.5-1.8 

Infecting time, f in f ec t 1 (min) 

31.0±15.0 

22.2 

19.7-23.8 

22.0±15.0 

18.6 

17.8-19.8 

Infectious life span, tj (h) 

49.0±15.0 

35.2 

34.6-36.2 

41.0±15.0 

32.0 

31.4-32.3 

Depletion rate, c (h — - 1 ) 

0.1±0.05 

0.078 

0.069-0.086 

0.1±0.05 

0.087 

0.076-0.097 

Production rate/cell, Prna (RNA copies/h) 

2,200±1,000 

3,355 

3,071-3,700 

370±1,000 

459 

425-498 

Viral burst size, b 2 (RNA copies) 

110±50 

118 

109-131 

15±50 

15 

13-16 

Reproductive number, Rq 3 

2000:11500 

3582 

3101-4085 

3000:11500 

4243 

3906-4572 


1 Computed using equation (3); 2 Burst size b = prna x tj; 3 Computed using equation (4). 


Table 4 Viral infection parameters obtained from MCMC simulations with the emcee engine: initial values, median values, credible 
intervals obtained from posterior distributions for both the wild type WT-H275 and the mutant MUT-H275Y HlNlpdm09 strains. 



Mean infectious phase times for WT-H275 and MUT-H275Y 




tau E ( h ) 

Mean infection times for WT-H275 and MUT-H275Y 



4,nW 


Figure 6 Posterior distributions for four viral parameters: production rate/cell, prna (RNA copies/h, top left corner), mean eclipse 
period, Tg; (h, top right corner), infectious life span, 77 (h, bottom left corner) and infecting time, ti n f ec t (min, bottom right corner) 
for both WT-H275 and MUT-H275Y strain data; these distributions have been obtained by running a set of MCMC simulations 
(emcee engine, 100 walkers, 2000 iterations) and Gaussian distributions have been fitted over (dashed lines) — the parameters of the 
fitted distributions are listed in Table 4. 


(see documentation at http: //docs . scipy. org/doc/ 
scipy-0.14.0/reference/generated/scipy.stats. 
ks_2samp.html) with inputs the two ith parameter 
distributions for WT-H275 and MUT-H275Y (the null 
distribution of this statistic is calculated under the 
null hypothesis that the samples are drawn from the 
same distribution; this test is used to determine how 


“similar” two parametric distributions are, assuming 
they are normal distributions; the choice of use of a 
frequentist test over a Byesian one is that there al¬ 
ready exists a scipy simple-to-use package that can 
be called with ease when constructing a parameter- 
estimation package) and Wi is a set of weights that 
are chosen function of the recoverability of ith param- 
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eter (for simplicity we use Wi = 1 here). We apply a 
two-sided Kolmogorov-Smirnoff test using the fitted 
Gaussian distributions. The A'S) test results are listed 
in Table 5. The statistic Ar is a percentage that tells 
us how different the two compared strains are, in our 
case, from the MCMC estimated four parameters in 
Table 5, the two strains are 94% different. Of course, 
this rationale could be extended for a larger set of pa¬ 
rameters, whichever are deemed as significant in the 
case of comparison, here we are using equation (18) 
with four parameters as an example only. 

WT-H275 and MUT-H275Y data: Kolmogorov-Smirnoff test 
results 


From equation (18), Ar = 0.94% 


Parameter 

KSi (%) 

Mean eclipse period, te 

0.99 

Infecting time, tinfect 

0.79 

Infectious life span, rj 

0.96 

Production rate/cell, Prna 

1.00 


Table 5 Kolmogorov-Smirnoff test results for four (example) 
parameters. 


6 Conclusions 

We used the data from [4] in a bid to refine the 
results presented there and formulate new concepts 
to analytically compare virus strains; we used a dif¬ 
ferent analysis method to estimate the viral dynam¬ 
ics parameters of model (1) - Markov Chain Monte 
Carlo simulations. We modified the model to a more 
computationally-robust model (17), formulated a like¬ 
lihood (14) and used an existing MCMC computer 
package (emcee presented in [9]) to obtain parame¬ 
ter distributions (posteriors). The MCMC results offer 
much narrower credible intervals than bootstrapping 
95% confidence intervals (see [4] for a result using 
bootstrap replicates); they also offer true parameter 
probability distribution functions (PDFs) in the form 
of the posterior distributions: full parameter results 
can be found in Table 4 and distributions are shown in 
Figure 6. We used these distributions and fitted Gaus¬ 
sian distributions to extract the credible intervals, the 
Gaussian fit being rather good. In a novel approach, we 
used a frequentist two-sided Kolmogorov-Smirnoff test 
to compare the two viral strains, comparison yielding 
a 94% difference from a parametric point of view. By 
computing the reproductive number Rq we show that 
the infectivity of the mutant strain is higher than the 
wid type; this has been shown in [4] as well but our 
values for Rq differ and suggest a smaller difference in 
infectivity between the two strains; this is supported 
by narrow credible intervals for i?o, see 4. 


Assuming a flat prior, the MCMC analysis method 
we present here does not make use of any assump¬ 
tions with regards to data, and, albeit the data is 
very small (order 60 data points in total), the re¬ 
sults are not only consistent with the ones presented 
in [4] but parametric credible intervals are narrower 
and we could obtain an analytic measure to distin¬ 
guish the two HlNlpdm09 strains. The use of emcee 
[9] is very easy and, if correctly configured in multi¬ 
processor mode, the program runs very fast - we could 
obtain the results we present here in a matter of one 
day. If this analysis method is correctly packaged, one 
could obtain this type of results in a matter of hours. 
Using a generic mathematical model and this analysis 
package could lead to results being obtained very fast, 
matter that might be very useful in the future in the 
case of a major pandemic. 
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